iT邦幫忙

2023 iThome 鐵人賽

DAY 8
0

繼續聽課...

Design A WEB crawler

check requirement

  • Q: What's target about this crawler?
  • A: It's search engine.
  • Q: so what are the webs we need to crawler? entire web? or just a dew site
  • A: absolutely entire web
  • Q: How often crawled?
  • A: 1 per week
  • Q: Do we need to check the web that we have crawed to check if it's be updated?
  • A: right, you need.
  • Q: Do we need to store copy of every page as we go?Does that include images?
  • A: yes, we need stroe at least html
  • Q: What about dynamic content?
  • A: we can consider it later

Try to design by myself

At first, because webs amount is very large, and the depth of the website is not the first thing that we need to concern about. So we can use BFS algo to crawl the web

algo: BFS

We need to use muti service to crawl the web, we can use a queue service like kafka to help us mantain the taksk
And we need to have a list for the urls we need to start with it

init: list of urls
algo: BFS
queue service: kafka

We can use the nosql db to record all result of crawl ans s3 to save the copy of page.
and use hash function to restore the url and page

init: list of urls
algo: BFS
queue servie: kafka
database: noSql 
storage: S3
data schema:  url/update_time/...some data for seo/s3_key/hash_content/

course display

algo: BFS

Queue of URL's to crawl -> page downloader -> URL extraction -> URL filter/stoplist
distributed storage (key, value)

Queue of URL's to crawl -> linked list / can backup on service and use hash to backup on different bucket

Page downloader -> we can put the url back to the queue

edge case: no limit for-loop website, the page need to login, the link is dynamoic render

Need to improve

Consider the function of component first, not too fast to say which service we can use.

先維持這樣每天看一題system design順便刷刷題的模式好了
剛到新公司上班 一邊還要練這些...好想吐喔

Excel Sheet Column Title

Q: https://leetcode.com/problems/excel-sheet-column-title/description/

class Solution {
    public String convertToTitle(int columnNumber) {
        StringBuilder sb = new StringBuilder();
        while (columnNumber > 0) {
            columnNumber--;
            int num = columnNumber % 26;
            sb.insert(0, (char) ('A' + num));
            columnNumber /= 26;
        }
        return sb.toString();
    }
}

House Robber

Q: https://leetcode.com/problems/house-robber/description/

/**
    f(n) n house robber max value
    f(0) = nums[0];
    f(1) = Math.max(nums[0], nums[1]);
    f(n) = Math.max(f(n-1), f(n-2) + nums[n]);
*/

class Solution {
    public int rob(int[] nums) {
        if (nums.length == 1) {
            return nums[0];
        }
        int n = nums.length;
        int[] dp = new int[n];
        dp[0] = nums[0];
        dp[1] = Math.max(nums[0], nums[1]);
        for (int i = 2; i < n;i++) {
            dp[i] = Math.max(dp[i-1], dp[i-2] + nums[i]);
        }
        return dp[n-1]; 
    }
}

Longest Mountain in Array

Q: https://leetcode.com/problems/longest-mountain-in-array/description/

class Solution {
    public int longestMountain(int[] arr) {
        int n = arr.length;
        int leftMax[] = new int[n];
        int rightMax[] = new int[n];
        Arrays.fill(leftMax, 1);
        Arrays.fill(rightMax, 1);
        for(int i = 1;i < n; i++) {
            if (arr[i - 1] < arr[i]) {
                leftMax[i] = leftMax[i - 1] + 1;
            }
        }
        for (int i = n - 2;i >= 0;i--) {
            if (arr[i + 1] < arr[i]) {
                rightMax[i] = rightMax[i+1] + 1;
            }
        }
        int max = 0;
        for(int i = 0;i < n;i++) {
            if (leftMax[i] > 1 && rightMax[i] > 1) {
                max = Math.max(max, leftMax[i] + rightMax[i] - 1);
            }
        }
        return max;
    }
}

上一篇
09/07
下一篇
09/09
系列文
30天準備google面試30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言